knitr::opts_chunk$set(echo = TRUE, message = FALSE)
#load required packages
library(tidyverse)
library(dplyr)
library(ggplot2)
library(plotly)
library(rworldmap)
For the purpose of this project, a Netflix dataset sourced from Kaggle has been used.
df <- read.csv("netflix_titles.csv", stringsAsFactors = F) # read data
summary(df) # summary statistics
## show_id type title director
## Length:7787 Length:7787 Length:7787 Length:7787
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## cast country date_added release_year
## Length:7787 Length:7787 Length:7787 Min. :1925
## Class :character Class :character Class :character 1st Qu.:2013
## Mode :character Mode :character Mode :character Median :2017
## Mean :2014
## 3rd Qu.:2018
## Max. :2021
## rating duration listed_in description
## Length:7787 Length:7787 Length:7787 Length:7787
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
length(df$release_year) # number of observations
## [1] 7787
UniqueValue = function (x) {length(unique(x))} # shows number of unique values in each column
apply(df, 2, UniqueValue) # number of unique values in each column
## show_id type title director cast country
## 7787 2 7787 4050 6832 682
## date_added release_year rating duration listed_in description
## 1566 73 15 216 492 7769
sapply(df, function(x) sum(x == "")) # number of missing values
## show_id type title director cast country
## 0 0 0 2389 718 507
## date_added release_year rating duration listed_in description
## 10 0 7 0 0 0
The original dataset consists of 7787 rows and 12 columns. All columns excluding ‘show_id’ and ‘title’ had unique values less than 7787, which is the total number of rows of this Netflix dataset. This indicates presence of duplicates in those columns. Also, many blank - "" - values has been found in ‘director’,‘cast’,‘country’,‘date_added’, and ‘rating’ columns.
Cleaning all misleading values at once will lead to major data loss. In order to prevent data biases, pre-processing for this dataset will be done to a minimum. Cleaning will only be done on columns in use.
In this section, only relevant columns will be pre-processed, then be used for creating visualizations. Pre-processing methodologies will be discussed under each analysis topic.
This section will investigate the Netflix trend in their contents addition by year - regardless of contents release date. Note that a movie released in 2017 can be added on Netflix in 2020. How many movies and TV shows they add to their platform each year? Does Netflix treat movies and TV shows differently?
The ‘date_added’ column was originally in ‘month day, year’ format as character type. Therefore, it has been converted to numeric ‘year’ for yearly trend analysis. Missing values in ‘date_added’ also has been removed for the purpose of this analysis.
# To investigate yearly trends, I need to eliminate Month and Day data from 'date_added' column
head(df$date_added,5) # this column is originally in 'Month Day, Year' format
## [1] "August 14, 2020" "December 23, 2016" "December 20, 2018"
## [4] "November 16, 2017" "January 1, 2020"
class(df$date_added) # column class: character
## [1] "character"
df$date_added <- sub(".*(\\d+{4}).*$", "\\1", df$date_added) # convert date to year
head(df$date_added,5) # this column now only has year data
## [1] "2020" "2016" "2018" "2017" "2020"
# total number of movies and TV shows added to Netflix by year
year <- df %>%
filter(!(date_added == "")) %>% # remove missing values
group_by(date_added) %>%
summarise(count = n()) # number of movies and TV shows by year added to Netflix
year
## # A tibble: 14 x 2
## date_added count
## <chr> <int>
## 1 2008 2
## 2 2009 2
## 3 2010 1
## 4 2011 13
## 5 2012 3
## 6 2013 11
## 7 2014 25
## 8 2015 88
## 9 2016 443
## 10 2017 1225
## 11 2018 1685
## 12 2019 2153
## 13 2020 2009
## 14 2021 117
# draw a bar plot for number of contents added to Netflix by year
bar <- ggplot(year, aes(x = date_added, y = count, fill = date_added)) +
geom_bar(stat = "identity") +
geom_text(aes(label = count), vjust = -0.3) +
labs(x = "Year", fill = "Year", title = "Number of Contents added to Netflix by Year") +
theme(plot.title = element_text(hjust = 0.5)) # center title
bar
# Netflix trends - movie vs TV shows
group <- df %>%
filter(!(date_added == "")) %>%
group_by(type, date_added) %>% # group by content type and year added
summarise(count = n())
group
## # A tibble: 24 x 3
## # Groups: type [2]
## type date_added count
## <chr> <chr> <int>
## 1 Movie 2008 1
## 2 Movie 2009 2
## 3 Movie 2010 1
## 4 Movie 2011 13
## 5 Movie 2012 3
## 6 Movie 2013 6
## 7 Movie 2014 19
## 8 Movie 2015 58
## 9 Movie 2016 258
## 10 Movie 2017 864
## # … with 14 more rows
# create bar plot for number of contents added by year by content type
bar <- ggplot(group, aes(x = date_added, y = count, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Netflix Trends by Year", x = "Year", "Type") +
theme(plot.title = element_text(hjust = 0.5)) # center title
bar
First bar plot shows the total number of movies and TV shows added on Netflix by year. After years of adding more contents each year, Netflix finally added less number of contents to its platform in 2020.
Second chart compares its trend in movies to that of TV shows. Although the number of movies added to Netflix outnumbers that of TV shows on Netflix, it started to show a decrease in 2020, whereas number of TV shows added has continued to increase. This may be an indication of Netflix focusing more on TV shows.
This section will show total number of movies and TV shows on Netflix.
# create data frame for number of contents by type
type <- df %>%
group_by(type) %>%
summarise(count = n())
type
## # A tibble: 2 x 2
## type count
## <chr> <int>
## 1 Movie 5377
## 2 TV Show 2410
#all plots created afterwards will have title at center
#theme_update(plot.title = element_text(hjust = 0.5))
# create bar plot for Netflix contents by type
bar <- ggplot(type, aes(x = type, y = count, fill = type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = count), vjust = -0.3) +
labs(title = "Netflix contents by type", x = "Type", fill = "Type") + # add title
theme(plot.title = element_text(hjust = 0.5)) # center title
bar
# create data frames each for movies and TV shows for further analysis
movie <- df[(df$type == "Movie"),]
tv <- df[(df$type == "TV Show"),]
As shown in the chart above, the number of movies on Netflix is more than double the number of TV shows.
For further analysis on each content type, the dataset has been split down into two separate datasets based on the ‘type’ column.
The remaining of this project will examine movies on Netflix. The dataset ‘movie’ that has been created above will be used.
In order to identify top 10 countries by number of movies, the ‘country’ column has been pre-processed. Originally, there were some movies that has listed multiple countries as their origin - for example, the movie ‘100 meters’ had ‘Portugal, Spain’ in its ‘country’ column. This created too many unique values for ‘country’ column. Therefore, I split the country name value whenever there was a comma in between them, and removed the comma and unnecessary blank spaces. At the end, I was able to retrieve clean list of country names.
# basic information about movies on Netflix
length(unique(movie$country)) # country column has many duplicates
## [1] 591
head(movie$country, 50) # some movies has listed multiple countries separated by comma
## [1] "Mexico" "Singapore"
## [3] "United States" "United States"
## [5] "Egypt" "United States"
## [7] "India" "India"
## [9] "United States" "Thailand"
## [11] "United States" "Nigeria"
## [13] "Norway, Iceland, United States" "India"
## [15] "United Kingdom" "India"
## [17] "India" "India"
## [19] "India" "United States"
## [21] "South Korea" "Italy"
## [23] "Canada" "Indonesia"
## [25] "Indonesia" "United States"
## [27] "Canada" "United States"
## [29] "Romania" "Romania"
## [31] "Spain" "Turkey"
## [33] "Iceland" "Turkey"
## [35] "Nigeria" "United States"
## [37] "United States" "United States"
## [39] "South Africa, Nigeria" "France"
## [41] "United States, South Africa" "Spain"
## [43] "Portugal, Spain" "United States"
## [45] "United States" "Indonesia"
## [47] "India" "United States"
## [49] "United States" "United States"
# split comma separated country names
country <- unlist(strsplit(movie$country, ", ")) # split country names at comma
unique(country) # some country names has comma at the end
## [1] "Mexico" "Singapore" "United States"
## [4] "Egypt" "India" "Thailand"
## [7] "Nigeria" "Norway" "Iceland"
## [10] "United Kingdom" "South Korea" "Italy"
## [13] "Canada" "Indonesia" "Romania"
## [16] "Spain" "Turkey" "South Africa"
## [19] "France" "Portugal" "Hong Kong"
## [22] "China" "Germany" "Argentina"
## [25] "Serbia" "Denmark" "Poland"
## [28] "Japan" "Kenya" "New Zealand"
## [31] "Pakistan" "Australia" "Taiwan"
## [34] "Netherlands" "Philippines" "United Arab Emirates"
## [37] "Brazil" "Iran" "Belgium"
## [40] "Israel" "Uruguay" "Bulgaria"
## [43] "Chile" "Colombia" "Algeria"
## [46] "Soviet Union" "Sweden" "Malaysia"
## [49] "Ireland" "Luxembourg" "Austria"
## [52] "Peru" "Senegal" "Switzerland"
## [55] "Ghana" "Saudi Arabia" "Armenia"
## [58] "Jordan" "Mongolia" "Namibia"
## [61] "Finland" "Lebanon" "Qatar"
## [64] "Vietnam" "Russia" "Malta"
## [67] "Kuwait" "Czech Republic" "Bahamas"
## [70] "Sri Lanka" "Cayman Islands" "Bangladesh"
## [73] "United States," "Zimbabwe" "Hungary"
## [76] "Latvia" "Liechtenstein" "Venezuela"
## [79] "Morocco" "Cambodia" "Albania"
## [82] "Nicaragua" "Greece" "Cambodia,"
## [85] "Croatia" "Guatemala" "West Germany"
## [88] "Poland," "Slovenia" "Dominican Republic"
## [91] "Nepal" "Samoa" "Bermuda"
## [94] "Ecuador" "Georgia" "Botswana"
## [97] "Iraq" "Vatican City" "Angola"
## [100] "Jamaica" "Kazakhstan" "Malawi"
## [103] "Slovakia" "Lithuania" "Afghanistan"
## [106] "Paraguay" "Somalia" "Sudan"
## [109] "Panama" "United Kingdom," "Uganda"
## [112] "East Germany" "Ukraine" "Montenegro"
# remove leftover commas from country names
country <- gsub(",","", country)
unique(country)
## [1] "Mexico" "Singapore" "United States"
## [4] "Egypt" "India" "Thailand"
## [7] "Nigeria" "Norway" "Iceland"
## [10] "United Kingdom" "South Korea" "Italy"
## [13] "Canada" "Indonesia" "Romania"
## [16] "Spain" "Turkey" "South Africa"
## [19] "France" "Portugal" "Hong Kong"
## [22] "China" "Germany" "Argentina"
## [25] "Serbia" "Denmark" "Poland"
## [28] "Japan" "Kenya" "New Zealand"
## [31] "Pakistan" "Australia" "Taiwan"
## [34] "Netherlands" "Philippines" "United Arab Emirates"
## [37] "Brazil" "Iran" "Belgium"
## [40] "Israel" "Uruguay" "Bulgaria"
## [43] "Chile" "Colombia" "Algeria"
## [46] "Soviet Union" "Sweden" "Malaysia"
## [49] "Ireland" "Luxembourg" "Austria"
## [52] "Peru" "Senegal" "Switzerland"
## [55] "Ghana" "Saudi Arabia" "Armenia"
## [58] "Jordan" "Mongolia" "Namibia"
## [61] "Finland" "Lebanon" "Qatar"
## [64] "Vietnam" "Russia" "Malta"
## [67] "Kuwait" "Czech Republic" "Bahamas"
## [70] "Sri Lanka" "Cayman Islands" "Bangladesh"
## [73] "Zimbabwe" "Hungary" "Latvia"
## [76] "Liechtenstein" "Venezuela" "Morocco"
## [79] "Cambodia" "Albania" "Nicaragua"
## [82] "Greece" "Croatia" "Guatemala"
## [85] "West Germany" "Slovenia" "Dominican Republic"
## [88] "Nepal" "Samoa" "Bermuda"
## [91] "Ecuador" "Georgia" "Botswana"
## [94] "Iraq" "Vatican City" "Angola"
## [97] "Jamaica" "Kazakhstan" "Malawi"
## [100] "Slovakia" "Lithuania" "Afghanistan"
## [103] "Paraguay" "Somalia" "Sudan"
## [106] "Panama" "Uganda" "East Germany"
## [109] "Ukraine" "Montenegro"
# store the list as data frame
movie_country <- data.frame(country = country)
unique(movie_country$country) # it now has clean country names
## [1] "Mexico" "Singapore" "United States"
## [4] "Egypt" "India" "Thailand"
## [7] "Nigeria" "Norway" "Iceland"
## [10] "United Kingdom" "South Korea" "Italy"
## [13] "Canada" "Indonesia" "Romania"
## [16] "Spain" "Turkey" "South Africa"
## [19] "France" "Portugal" "Hong Kong"
## [22] "China" "Germany" "Argentina"
## [25] "Serbia" "Denmark" "Poland"
## [28] "Japan" "Kenya" "New Zealand"
## [31] "Pakistan" "Australia" "Taiwan"
## [34] "Netherlands" "Philippines" "United Arab Emirates"
## [37] "Brazil" "Iran" "Belgium"
## [40] "Israel" "Uruguay" "Bulgaria"
## [43] "Chile" "Colombia" "Algeria"
## [46] "Soviet Union" "Sweden" "Malaysia"
## [49] "Ireland" "Luxembourg" "Austria"
## [52] "Peru" "Senegal" "Switzerland"
## [55] "Ghana" "Saudi Arabia" "Armenia"
## [58] "Jordan" "Mongolia" "Namibia"
## [61] "Finland" "Lebanon" "Qatar"
## [64] "Vietnam" "Russia" "Malta"
## [67] "Kuwait" "Czech Republic" "Bahamas"
## [70] "Sri Lanka" "Cayman Islands" "Bangladesh"
## [73] "Zimbabwe" "Hungary" "Latvia"
## [76] "Liechtenstein" "Venezuela" "Morocco"
## [79] "Cambodia" "Albania" "Nicaragua"
## [82] "Greece" "Croatia" "Guatemala"
## [85] "West Germany" "Slovenia" "Dominican Republic"
## [88] "Nepal" "Samoa" "Bermuda"
## [91] "Ecuador" "Georgia" "Botswana"
## [94] "Iraq" "Vatican City" "Angola"
## [97] "Jamaica" "Kazakhstan" "Malawi"
## [100] "Slovakia" "Lithuania" "Afghanistan"
## [103] "Paraguay" "Somalia" "Sudan"
## [106] "Panama" "Uganda" "East Germany"
## [109] "Ukraine" "Montenegro"
# count country names appearing in the original dataset
movie_country <- movie_country %>%
group_by(country) %>%
summarise(count = n())
head(movie_country,10)
## # A tibble: 10 x 2
## country count
## <chr> <int>
## 1 Afghanistan 1
## 2 Albania 1
## 3 Algeria 2
## 4 Angola 1
## 5 Argentina 64
## 6 Armenia 1
## 7 Australia 84
## 8 Austria 10
## 9 Bahamas 1
## 10 Bangladesh 3
# top 10 countries by number of movies added to Netflix
bar <- movie_country %>%
arrange(desc(count)) %>% # most to least
slice(1:10) %>% # top 10 countries
ggplot(., aes(x=reorder(country,-count), y = count)) + # bar plot most to least
geom_bar(stat='identity') +
theme_classic() +
labs(x = "Country", title = "Top 10 Countries by number of movies on Netflix") +
geom_text(aes(label = count), vjust = -0.3) # add count labels
bar
Above chart shows top 10 countries by number of movies on Netflix. According to this chart, the United States had the most number of movies on Netflix.
spdf <- joinCountryData2Map(movie_country, joinCode="NAME", nameJoinColumn="country")
mapParams <- mapCountryData(spdf,
nameColumnToPlot="count",
catMethod=c(0,1,3,5,10,50,100,300,500,1000,2500),
mapTitle = "Number of Movies added to Netflix",
addLegend = FALSE)
do.call(addMapLegend, c(mapParams, legendLabels="all", legendWidth=0.5))
A world map has been created to visualize Netflix’s global trend. Continent-wise, Americas - North and South America - exhibits high number of Netflix movies overall. Africa showed the least number of movies on Netflix.
This section will study Netflix’s preference for country of origin for movies. This will be done by creating a line graph for number of movies from 5 countries by year. Note that it uses movies’ date added on Netflix for the time variable. This study has no relation to the amount of movie production done in each country.
genre_country <- movie %>%
mutate(listed_in = strsplit(as.character(listed_in), ", ")) %>% # separate genre by commas
unnest(listed_in) %>%
mutate(country = strsplit(as.character(country), ", ")) %>% # separate country name by commas
unnest(country) %>%
select("title","country","date_added","listed_in")
genre_country
## # A tibble: 14,342 x 4
## title country date_added listed_in
## <chr> <chr> <chr> <chr>
## 1 7:19 Mexico 2016 Dramas
## 2 7:19 Mexico 2016 International Movies
## 3 23:59 Singapore 2018 Horror Movies
## 4 23:59 Singapore 2018 International Movies
## 5 9 United States 2017 Action & Adventure
## 6 9 United States 2017 Independent Movies
## 7 9 United States 2017 Sci-Fi & Fantasy
## 8 21 United States 2020 Dramas
## 9 122 Egypt 2020 Horror Movies
## 10 122 Egypt 2020 International Movies
## # … with 14,332 more rows
unique(genre_country$country) # some country names still have commas
## [1] "Mexico" "Singapore" "United States"
## [4] "Egypt" "India" "Thailand"
## [7] "Nigeria" "Norway" "Iceland"
## [10] "United Kingdom" "South Korea" "Italy"
## [13] "Canada" "Indonesia" "Romania"
## [16] "Spain" "Turkey" "South Africa"
## [19] "France" "Portugal" "Hong Kong"
## [22] "China" "Germany" "Argentina"
## [25] "Serbia" "Denmark" "Poland"
## [28] "Japan" "Kenya" "New Zealand"
## [31] "Pakistan" "Australia" "Taiwan"
## [34] "Netherlands" "Philippines" "United Arab Emirates"
## [37] "Brazil" "Iran" "Belgium"
## [40] "Israel" "Uruguay" "Bulgaria"
## [43] "Chile" "Colombia" "Algeria"
## [46] "Soviet Union" "Sweden" "Malaysia"
## [49] "Ireland" "Luxembourg" "Austria"
## [52] "Peru" "Senegal" "Switzerland"
## [55] "Ghana" "Saudi Arabia" "Armenia"
## [58] "Jordan" "Mongolia" "Namibia"
## [61] "Finland" "Lebanon" "Qatar"
## [64] "Vietnam" "Russia" "Malta"
## [67] "Kuwait" "Czech Republic" "Bahamas"
## [70] "Sri Lanka" "Cayman Islands" "Bangladesh"
## [73] "United States," "Zimbabwe" "Hungary"
## [76] "Latvia" "Liechtenstein" "Venezuela"
## [79] "Morocco" "Cambodia" "Albania"
## [82] "Nicaragua" "Greece" "Cambodia,"
## [85] "Croatia" "Guatemala" "West Germany"
## [88] "Poland," "Slovenia" "Dominican Republic"
## [91] "Nepal" "Samoa" "Bermuda"
## [94] "Ecuador" "Georgia" "Botswana"
## [97] "Iraq" "Vatican City" "Angola"
## [100] "Jamaica" "Kazakhstan" "Malawi"
## [103] "Slovakia" "Lithuania" "Afghanistan"
## [106] "Paraguay" "Somalia" "Sudan"
## [109] "Panama" "United Kingdom," "Uganda"
## [112] "East Germany" "Ukraine" "Montenegro"
length(unique(genre_country$country)) # number of unique country names
## [1] 114
unique(genre_country$listed_in) # clean; no leftover commas to be deleted
## [1] "Dramas" "International Movies"
## [3] "Horror Movies" "Action & Adventure"
## [5] "Independent Movies" "Sci-Fi & Fantasy"
## [7] "Thrillers" "Documentaries"
## [9] "Sports Movies" "Comedies"
## [11] "Romantic Movies" "Movies"
## [13] "Music & Musicals" "LGBTQ Movies"
## [15] "Faith & Spirituality" "Children & Family Movies"
## [17] "Classic Movies" "Cult Movies"
## [19] "Stand-Up Comedy" "Anime Features"
genre_country <- genre_country %>%
mutate(country = gsub(",","",country)) # remove commas by replacing them with space
genre_country
## # A tibble: 14,342 x 4
## title country date_added listed_in
## <chr> <chr> <chr> <chr>
## 1 7:19 Mexico 2016 Dramas
## 2 7:19 Mexico 2016 International Movies
## 3 23:59 Singapore 2018 Horror Movies
## 4 23:59 Singapore 2018 International Movies
## 5 9 United States 2017 Action & Adventure
## 6 9 United States 2017 Independent Movies
## 7 9 United States 2017 Sci-Fi & Fantasy
## 8 21 United States 2020 Dramas
## 9 122 Egypt 2020 Horror Movies
## 10 122 Egypt 2020 International Movies
## # … with 14,332 more rows
unique(genre_country$country) # commas removed from country names
## [1] "Mexico" "Singapore" "United States"
## [4] "Egypt" "India" "Thailand"
## [7] "Nigeria" "Norway" "Iceland"
## [10] "United Kingdom" "South Korea" "Italy"
## [13] "Canada" "Indonesia" "Romania"
## [16] "Spain" "Turkey" "South Africa"
## [19] "France" "Portugal" "Hong Kong"
## [22] "China" "Germany" "Argentina"
## [25] "Serbia" "Denmark" "Poland"
## [28] "Japan" "Kenya" "New Zealand"
## [31] "Pakistan" "Australia" "Taiwan"
## [34] "Netherlands" "Philippines" "United Arab Emirates"
## [37] "Brazil" "Iran" "Belgium"
## [40] "Israel" "Uruguay" "Bulgaria"
## [43] "Chile" "Colombia" "Algeria"
## [46] "Soviet Union" "Sweden" "Malaysia"
## [49] "Ireland" "Luxembourg" "Austria"
## [52] "Peru" "Senegal" "Switzerland"
## [55] "Ghana" "Saudi Arabia" "Armenia"
## [58] "Jordan" "Mongolia" "Namibia"
## [61] "Finland" "Lebanon" "Qatar"
## [64] "Vietnam" "Russia" "Malta"
## [67] "Kuwait" "Czech Republic" "Bahamas"
## [70] "Sri Lanka" "Cayman Islands" "Bangladesh"
## [73] "Zimbabwe" "Hungary" "Latvia"
## [76] "Liechtenstein" "Venezuela" "Morocco"
## [79] "Cambodia" "Albania" "Nicaragua"
## [82] "Greece" "Croatia" "Guatemala"
## [85] "West Germany" "Slovenia" "Dominican Republic"
## [88] "Nepal" "Samoa" "Bermuda"
## [91] "Ecuador" "Georgia" "Botswana"
## [94] "Iraq" "Vatican City" "Angola"
## [97] "Jamaica" "Kazakhstan" "Malawi"
## [100] "Slovakia" "Lithuania" "Afghanistan"
## [103] "Paraguay" "Somalia" "Sudan"
## [106] "Panama" "Uganda" "East Germany"
## [109] "Ukraine" "Montenegro"
length(unique(genre_country$country))
## [1] 110
movie_year <- genre_country %>%
# mutate(date_added = as.numeric(date_added)) %>%
filter(country == "United States" |
country == "India"|
country == "United Kingdom"|
country == "Canada"|
country == "France") %>%
group_by(country, date_added) %>%
summarise(count = n())
movie_year
## # A tibble: 45 x 3
## # Groups: country [5]
## country date_added count
## <chr> <chr> <int>
## 1 Canada 2013 1
## 2 Canada 2014 2
## 3 Canada 2015 5
## 4 Canada 2016 28
## 5 Canada 2017 116
## 6 Canada 2018 125
## 7 Canada 2019 125
## 8 Canada 2020 149
## 9 Canada 2021 10
## 10 France 2011 2
## # … with 35 more rows
#bar <- ggplot(genre_5_country, aes(fill=listed_in, y=count, x=country)) +
# geom_bar(position="fill", stat="identity")
#bar
chart <- ggplot(movie_year, aes(x = date_added, y = count, group = country, color = country)) +
geom_line() +
labs(x = "Year", y = "Count", fill = "Country", title = "Number of movies by Year in Top 5 Countries") +
theme(plot.title = element_text(hjust = 0.5)) # center title
chart <- ggplotly(chart)
chart
The line graph above shows the number of movies added on Netflix by year for 5 different countries. The country legends on the right enables select and deselect - you can only look at the countries you want. While the number of movies from all 5 countries are on a downturn recently, the graph suggests that Netflix has been adding less number of Indian movies on their platform since 2018, as opposed to 2019 for American and British film and 2020 for French and Canadian film.
This section will research Netflix movie genre trends by examining which genre has been the most popular in which year. I will use ‘listed in’ column for genre and ‘date_added’ for date of movies added on Netflix.
Similar pre-processing methodology for ‘country’ column will be used for ‘listed in’ column. The genre values in ‘listed in’ column will be split down at commas. After cleaning, there was a total of 20 genre.
I will create 2 bar charts - an interactive stacked bar chart for comparison of changes of one genre over time and a static percentage stacked bar chart
head(movie$listed_in,10) # each movie has multiple genres separated by commas
## [1] "Dramas, International Movies"
## [2] "Horror Movies, International Movies"
## [3] "Action & Adventure, Independent Movies, Sci-Fi & Fantasy"
## [4] "Dramas"
## [5] "Horror Movies, International Movies"
## [6] "Dramas"
## [7] "Horror Movies, International Movies"
## [8] "Horror Movies, International Movies, Thrillers"
## [9] "Dramas, Thrillers"
## [10] "Documentaries, International Movies, Sports Movies"
movie_genre <- movie %>%
mutate(listed_in = strsplit(as.character(listed_in), ", ")) %>% # split genre column by comma
unnest(listed_in) %>%
select("title","date_added","listed_in")
movie_genre
## # A tibble: 11,546 x 3
## title date_added listed_in
## <chr> <chr> <chr>
## 1 7:19 2016 Dramas
## 2 7:19 2016 International Movies
## 3 23:59 2018 Horror Movies
## 4 23:59 2018 International Movies
## 5 9 2017 Action & Adventure
## 6 9 2017 Independent Movies
## 7 9 2017 Sci-Fi & Fantasy
## 8 21 2020 Dramas
## 9 122 2020 Horror Movies
## 10 122 2020 International Movies
## # … with 11,536 more rows
unique(movie_genre$listed_in) # all Netflix genre
## [1] "Dramas" "International Movies"
## [3] "Horror Movies" "Action & Adventure"
## [5] "Independent Movies" "Sci-Fi & Fantasy"
## [7] "Thrillers" "Documentaries"
## [9] "Sports Movies" "Comedies"
## [11] "Romantic Movies" "Movies"
## [13] "Music & Musicals" "LGBTQ Movies"
## [15] "Faith & Spirituality" "Children & Family Movies"
## [17] "Classic Movies" "Cult Movies"
## [19] "Stand-Up Comedy" "Anime Features"
length(unique(movie_genre$listed_in)) # total of 20 genre
## [1] 20
movie_genre <- movie_genre %>%
group_by(date_added, listed_in) %>%
summarise(count = n()) # count number of genres by year
movie_genre
## # A tibble: 165 x 3
## # Groups: date_added [14]
## date_added listed_in count
## <chr> <chr> <int>
## 1 2008 Dramas 1
## 2 2008 Independent Movies 1
## 3 2008 Thrillers 1
## 4 2009 Dramas 1
## 5 2009 Horror Movies 1
## 6 2009 International Movies 1
## 7 2010 Cult Movies 1
## 8 2010 Horror Movies 1
## 9 2011 Children & Family Movies 1
## 10 2011 Dramas 13
## # … with 155 more rows
# interactive stacked bar chart
bar <- ggplot(movie_genre, aes(fill=listed_in, y=count, x=date_added)) +
geom_bar(position="stack", stat="identity") +
labs (x = "Year", fill = "Genre", title = "Movie Genre by Year") +
theme(plot.title = element_text(hjust = 0.5)) # center title
bar <- ggplotly(bar)
bar
# percent stacked bar chart
bar <- ggplot(movie_genre, aes(fill=listed_in, y=count, x=date_added)) +
geom_bar(position="fill", stat="identity") +
labs (x = "Year", y = "Percent", fill = "Genre", title = "Movie Genre by Year") +
theme(plot.title = element_text(hjust = 0.5)) # center title
bar
First graph is stacked bar chart for all genre by year. You can de-select all but one to see one genre’s change over time.
Second graph is the percentage stacked barchart for Netflix movie genre. The graph clearly illustrates Netflix having more variety of genre in their movie collection. In addition, the percentage of international and LGBTQ movie genre - bar in light blue color on the graph - has been increasing since 2014. This clearly indicates that Netflix has been putting a lot of effort in promoting diversity in their film collection over time.
This section will study movie genre popularity in top 5 countries - United States, India, United Kingdom, Canada, and France. Percentage stacked bar chart and donut chart will be created for visualization.
# Movie genre by percentage in top 5 countries
genre_5_country <- genre_country %>%
filter(country == "United States" |
country == "India"|
country == "United Kingdom"|
country == "Canada"|
country == "France") %>%
group_by(country, listed_in) %>%
summarise(count = n())
# Percentage stacked bar chart
bar <- ggplot(genre_5_country, aes(fill=listed_in, y=count, x=country)) +
geom_bar(position="fill", stat="identity") +
labs (x = "Country", y = "Percent", fill = "Genre", title = "Movie Genre by Country") +
theme(plot.title = element_text(hjust = 0.5)) # center title
bar
# genre donut chart for each country
# https://homepage.divms.uiowa.edu/~luke/classes/STAT4580/catone.html
pie <- ggplot(genre_5_country) +
geom_col(aes(x = 1, y = count, fill = listed_in), position = "fill") +
coord_polar(theta = "y") +
labs(title = "Movie Genre", fill = "Genre") +
facet_wrap(~ country) +
theme_bw() +
theme(axis.title = element_blank(), # remove axis and grid lines
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank())
pie <- pie + xlim(0, 1.5)
pie
# donut chart for movie genre in United States
genre_usa <- genre_country %>%
filter(country == "United States") %>%
group_by(country, listed_in) %>%
summarise(count = n()) %>%
arrange(desc(count))
# donut chart
pie <- ggplot(genre_usa) +
geom_col(aes(x = 1, y = count, fill = reorder(listed_in,count)), position = "fill") +
coord_polar(theta = "y", start = 0) +
labs(title = "Movie Genre in United States", fill = "Genre") +
guides(fill = guide_legend(reverse = TRUE)) +
facet_wrap(~ country) +
theme_bw() +
theme(axis.title = element_blank(), # remove axis and grid lines
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank())
pie <- pie + xlim(0, 1.5)
pie
First and second chart shows the distribution of movie genre in top 5 countries. Among the five countries, very large portion of French and Indian movies were categorized as international movies on Netflix. Only a small amount of American movies has been categorized as international film, and this all come down to the fact that Netflix originated from United States. In addition, France and India - two countries Netflix identified as foreign - showed weaker genre diversity on Netflix compared to United States. Therefore, increasing genre diversity in foreign movies on Netflix may be the next step Netflix will make within the next few years.